Project 3: Data Analysis with R

Udacity Nanodegree - November Cohort

Loan Data from Prosper Exploration by Michael Strobl

Here you can download the dataset: Loan Data from Prosper and Explanations to the Dataset
Prosper is a platform where individuals can invest in personal loans or request to borrow money.

Here is a Youtube Video from Fox Business with the CEO of Prosper who explains the Propser System.

Dataset: Read in Dataset and Libraries:

loan <- read.csv('prosperLoanData.csv')
library(ggplot2)
library(gridExtra)
library(lubridate)
library(RColorBrewer)


Note: The dataset ‘prosperLoanData.csv’ must be in the same folder as this ‘Project_3.Rmd’ file.

1 Univariate Plots Section

In the following, 19 variables of the Dataset are plotted and described.These are divided in numeric and categorial data. The numeric data is also described with a R summary commmand.

Numeric Data


Variable 1: Loan Original Amount


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The loans are between 1000 and 35000 Dollar. The most loans are distributed between 5000 and 15000 Dollar. You can also see peaks at exactly 5000, 10000, 15000, 20000, 25000 USD.

Variable 2: Borrower Annual Percentage Rate (APR)


Note: To avoid Outliers in the plot, the dataset was reduced from 0.1% to 99.9% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

The most Borrower APRs are between 15% and 30%, except the biggest frequency with rates around 36%.

Compared to the normal curve, the distribution of the APRs looks quite similar except the high frequency of APR around 36%.

Variable 3: Borrower Rate


Note: To avoid Outliers in the plot, the dataset was reduced from 0.1% to 99.9% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The highest frequency of rates are around 32%, followed by around 15% and 20%


Compared to the normal curve, the distribution of rates looks quite similar except the high frequency of rates around 32%.

Variable 4: Lender Yield


Lender Yield is the Borrower Rate less Service Fees.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

The Lender Yield can be negative and the highest frequency is around 30%, followed by 16%, 15% and 22%.

Compared to the normal curve, the distribution of rates looks quite similar except the high frequency of rates around 30%.


Variable 5: Investors


Note: To avoid Outliers in the plot, the dataset was reduced from 0% to 99% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

The distribution seems to be a long tail. The most loans have 1 investor and then the number of loans is decreasing with higher number of investors.

Variable 6: Monthly Loan Payment


Note: To avoid Outliers in the plot, the dataset was reduced from 0% to 99% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0

Most monthly loan payments are around 200 USD.

Variable 7: Stated Monthly Income


Note: To avoid Outliers in the plot, the dataset was from 0 to 20000 USD.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

The most Borrowers have an income between 3000 and 6000 USD.

Variable 8: Debt To Income Ratio


Note: To avoid Outliers in the plot, the dataset was from 0 to 1%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The most Borrowers have a Debt to Income Ratio of 10% to 30%.

Variable 9: Average Credit Score


The average Credit Score is the median of a upper and a lower credit score ranking by the consumer credit rating agency.

#AverageCreditScore
loan$AverageCreditScore <- (loan$CreditScoreRangeLower+loan$CreditScoreRangeUpper)/2


Note: To avoid Outliers in the plot, the dataset was reduced from 1% to 99% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

The most loans have a rating between 650 and 750 points.

Variable 10: Loan Per Investor



#LoanPerInvestor
Investors2 <- subset(loan, Investors > 1)
Investors2$LoanPerInvestor <- Investors2$LoanOriginalAmount/Investors2$Investors



##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     7.299    48.190    70.000   220.200   125.000 12500.000

Most Investors give loans of 20 to 100 USD. 50 USD is the absolute peak.

Variable 11: Debt Service Coverage Ratio/Debt Coverage Ratio


Explanation of the Debt Coverage Ratio
It’s the relationship between Stated Monthly Income and Monthly Loan Payment.

#Debt Coverage Ratio
loan$DebtCoverageRatio2 = loan$StatedMonthlyIncome/loan$MonthlyLoanPayment


Note: To avoid Outliers in the plot, the dataset was reduced from 0 to 100.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      13      20     Inf      35     Inf      15

The most Borrowers have a Debt Coverage Ratio around 20. That means their income is 20 times their monthly loan payment. The higher the value the more trustworthy is the borrower.

Categorial Data


Variable 12: Term



The loans have three possible terms: 12, 36, 60 months. 77% of the loans have a term of 36 months.

Variable 13: Loan Status



The most loans are current, followed by completed, chargedoff and defaulted.

Variable 14: Employment Status



The most loans are given to employed, full-time and self employed people.

Variable 15: Loan Categories


Note: To avoid overplotting, the dataset was reduced to the top 10 Loan Categories.

The most loan categories which are defined (not NA oder Other) are for debt consolidation, home improvements, business and auto.

Variable 16: Home Owner



The Borrowers with own Homes have a small lead to ones without Homes.

Variable 17: Top Occupations


Note: To avoid overplotting, the dataset was reduced to the top 10 Occupations.

The occupations with the highest frequency are administrative assistants, analysts and accountants/CPAs.


Variable 18: Year



The Year Variable was created with the Lubridate Library.
In 2013, the most loans were given and in 2009 was a big fall of loans.



What is the structure of your dataset?

Size of Dataset: 86,5 MB Variables: 81 Number of Loans in the Dataset: about 114000
Volume of all Loans: about 950 Million
Number of Investors: about 9,2 Million
Average Invest: USD 103
Average Loan: USD 8300
Minimum Loan: USD 1000
Maximum Loan: USD 35000
Terms: 12, 36 or 60 months

What are the main features of interest in your dataset?

I want to see:
- How high are the loans people take with the prosper platform (main feature: Loan OriginalAmount)?
- How expensive are the loans (main feature: Borrower Rate)?

What other features in the dataset do you think will help support your investigation into your feature of interest?


- Who are the people who take the loans?
- For what do people take the loan?
- How many investors has a loan?
- Is there a rating for the loans?

Did you create any new variables from existing variables in the dataset?

Average Credit Score
Loan Per Investor
Debt Coverage Ratio

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

BorrowerRate, BorrowerAPR and LenderYield are almost equal to a normal distribution and Investors seems to be a long tail distribution. Therefore, I factorised the variables “Term” and changed empty values to “NA”. ListingCategory..numeric. is changed to ListingCategory where all numeric data is referred to their meanings, for example: “1” is referred to “Debt Consolidation”.

2 Bivariate Analysis


BorrowerRate vs BorrowerAPR vs LenderYield



Correlation:
a) BorrowerRate & BorrowerAPR

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and BorrowerAPR
## t = 2347.699, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9897057 0.9899409
## sample estimates:
##      cor 
## 0.989824


b) BorrowerRate & LenderYield

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LenderYield
## t = 8493.938, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9992021 0.9992204
## sample estimates:
##       cor 
## 0.9992113


c) BorrowerAPR & LenderYield

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerAPR and LenderYield
## t = 2291.732, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9892049 0.9894515
## sample estimates:
##       cor 
## 0.9893289

Conclusion: The distributions of all three features seem quite similar. Also, all three features have correlations of almost 1. That’s why the following analysis is reduced to one feature, Borrower Rate.

1. Loan Original Amount

1.1 Borrower Rate vs Loan Original Amount


## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LoanOriginalAmount
## t = -117.5822, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3341283 -0.3237719
## sample estimates:
##        cor 
## -0.3289599

Note: The red dot shows the mean of the plotted feature(s)
The most loans have an amount at 5000,10000,15000 and 20000 USD and the BorrowerRate have a huge spreading between 5% and 35%. The value of -32% indicates a small negative correlation between Loan Original Amount and Borrower Rate.

1.2 Investors vs Loan Original Amount

## 
##  Pearson's product-moment correlation
## 
## data:  Investors and LoanOriginalAmount
## t = 138.7077, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3751140 0.3850494
## sample estimates:
##       cor 
## 0.3800926


Note: The red dot shows the mean of the plotted feature(s)
The loans around 25000 USD have the most investors but there are also many high loans with only 1 investors.
The value of 38% indicates a small positive correlation between Investors and Loan Original Amount.

In the following, only loans with more than 1 investor is analysed.

## 
##  Pearson's product-moment correlation
## 
## data:  Investors and LoanOriginalAmount
## t = 263.0167, df = 86121, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6636999 0.6711074
## sample estimates:
##       cor 
## 0.6674202

The value rises from 38% to 67% and indicates a high positive correlation between Loan Original Amount and Investors.


1.3 Monthly Loan Payment vs Loan Original Amount

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 867.8179, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9312165 0.9327426
## sample estimates:
##       cor 
## 0.9319837

You can see three different lines which represent the three different terms. The higher the loan, the higher is the monthly Loan Payment.
This proofs also the Pearson Correlation Test. There is a almost perfect correlation of 93%.

1.4 Average Credit Score vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
There is a huge spread between a credit score of 500 and 800. Loans over 10000 USD have mostly a credit score over 700. Loans under 10000 USD are mostly be realised with a credit score of 500 and higher.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and AverageCreditScore
## t = 122.0719, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3357190 0.3460095
## sample estimates:
##       cor 
## 0.3408745

A value of -34% shows a weak negative correlation between Average Credit Score and Loan Original Amount.


1.5 Stated Monthly Income vs Loan Original Amount



People have mostly higher loans when they have a higher income. But they are also exceptions (see bottom right)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 69.3527, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1956816 0.2068243
## sample estimates:
##       cor 
## 0.2012595

A value of 20% shows a weak positive correlation between Stated Monthly Income and Loan Original Amount.


1.6 Debt To Income Ratio vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
It seems that the Debt To Income Ratio has no visible effect on the Loan Original Amount.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtToIncomeRatio
## t = 3.2828, df = 105381, p-value = 0.001028
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.004074882 0.016148830
## sample estimates:
##        cor 
## 0.01011222

The value of 0.01 proofs this fact and shows almost no correlation between the two features.


1.7 Loan Per Investor vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
The most loans between 0 and 25000 USD are created by single invests of 50 to 100 USD.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanPerInvestor and LoanOriginalAmount
## t = 35.555, df = 86121, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1136895 0.1268536
## sample estimates:
##       cor 
## 0.1202769

A value of 12% shows a weak positive correlation between these two features.


1.8 Debt Coverage Ratio vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
Top left shows high loans with low Debt Coverage Ratio while bottom right shows low loans with high Debt Coverage Ratio.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtCoverageRatio2
## t = NaN, df = 113920, p-value = NA
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  NaN NaN
## sample estimates:
## cor 
## NaN

There is a value of NaN.


1.9 Term vs Loan Original Amount


Note: The red dot shows the mean of the plotted feature(s)
The higher the term, the higher the Loan Original Amount.

1.10 Loan Status vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
Current loans have the highest loan amounts while cancelled has the lowest.

1.11 Employment Status vs Loan Orignal Amount



Note: The red dot shows the mean of the plotted feature(s)
Employed and Full-time have the highest loans while retired and not-employed are rarely over 10000 USD.

1.12 Loan Categories vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
The highest amounts have debt consolidation, business and baby & adotpion.

1.13 Homeowner vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
Homeowner have higher amounts than non-homeowner.

1.14 Top Occupations vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
The highest loans have Attorneys and the lowest have Bus Drivers.

1.15 Year vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
The year 2013 and 2014 have the highest loan amounts while 2008 has the lowest ones.


2. BorrowerRate


2.1 Investors vs Borrower Rate


Note: The red dot shows the mean of the plotted feature(s)
The most loans have rates between 5% and 35%, while the loans with the most investors have rates of 10 to 20%.

2.2 Monthly Loan Payment vs Borrower Rate


Note: The red dot shows the mean of the plotted feature(s)
Most payments are between 0 and 1000 USD and have rates between 5% and 35%.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and MonthlyLoanPayment
## t = -85.2021, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2501933 -0.2392759
## sample estimates:
##        cor 
## -0.2447424

A value of -24% shows a weak negative correlation between Borrower Rate and Monthly Loan Payment.

2.3 Average Credit Score vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
With higher score, the interest rates seem to fall.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and AverageCreditScore
## t = -175.1695, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

A value of -46% shows a middle negative correlation between Borrower Rate and Average Credit Score.

2.4 Stated Monthly Income vs Borrower Rate



The Monthly Income doesn’t seem to have an huge effect to the Borrower Rate.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and StatedMonthlyIncome
## t = -30.1548, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09473938 -0.08321827
## sample estimates:
##        cor 
## -0.0889818

A value of -8% shows a weak negative correlation between Stated Monthly Income and Borrower Rate.


2.5 Debt To Income Ratio vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
When you look at the red dots, a Debt To Income Ratio from 0 to 0.25 gives Borrower Rates mostly from 5% to 25% while higher Debt To Income Ratio have almost constantly Borrower Rate of 25%.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and DebtToIncomeRatio
## t = 20.4649, df = 105381, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05690080 0.06892819
## sample estimates:
##        cor 
## 0.06291678

The value of 0.06 shows a weak positive correlation betweeen these two features.


2.6 Loan Per Investor vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
It seems that Loan per Investor has no visible effect on the Borrower Rate.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanPerInvestor and BorrowerRate
## t = 20.6332, df = 86121, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06348697 0.07677859
## sample estimates:
##        cor 
## 0.07013589

The value of 0.07 shows a weak positive correlation betweeen these two features.


2.7 Debt Coverage Ratio vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
All Debt Coverage Ratio seem to be distributed equally between 5% and 35%.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and DebtCoverageRatio2
## t = NaN, df = 113920, p-value = NA
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  NaN NaN
## sample estimates:
## cor 
## NaN

There is a value of NaN.


2.8 Term vs Borrower Rate


Note: The red dot shows the mean of the plotted feature(s)
Terms of 36 or 60 months have higher rates than ones of 12 months.

2.9 Loan Status vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
Current and cancelled loans have the lowes borrower rates while rates with past due status and or defaulted and chargedoff rates have the highest.

2.10 Employment Status vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
Not-employed and Other have the highest rates while the other categories seem to have rates around 20%.

2.11 Loan Categories vs Borrower Rates



Note: The red dot shows the mean of the plotted feature(s)
Auto and Other have higher rates than the others. Boat and Debt Consolidation have the lowest.

2.12 Homeowner vs Borrower Rates



Note: The red dot shows the mean of the plotted feature(s)
Homeowner have lower rates than non-homeowner.

2.13 Top Occupations vs Borrower Rates



Note: The red dot shows the mean of the plotted feature(s)
Administrative Assistans and Bus Drivers have the highest interest rates while Attorneys and Architects have the lowest.

2.14 Year vs Borrower Rates



Note: The red dot shows the mean of the plotted feature(s)
The interest rates are below 20% from 2005 to 2009, above 20% from 2010 to 2012 and again below 20% in 2013 and in 2014.

3. Additional bivariate Plots


3.1 Investors vs Monthly Loan Payment


## 
##  Pearson's product-moment correlation
## 
## data:  Investors and MonthlyLoanPayment
## t = 141.8441, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3824632 0.3923333
## sample estimates:
##       cor 
## 0.3874093

Note: The red dot shows the mean of the plotted feature(s)
The most loans have Monthly Loan Payments between 0 and 1000 USD and between 0 and 300 Investors. The red dots show the means of each Investors-Monthly Loan Payment-Pair. According to the Pearson Correlation, there is a weak positive relation between the pairs.

3.2 Investors vs Average Credit Score


## 
##  Pearson's product-moment correlation
## 
## data:  Investors and AverageCreditScore
## t = 94.9155, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2659485 0.2767345
## sample estimates:
##     cor 
## 0.27135

Note: The red dot shows the mean of the plotted feature(s)

Most loans have Average Credit Scores between 700 to 800 points and between 0 and 500 investors. There is a weak positive correlation between Investors and Average Credit Scores.

3.3 Monthly Loan Payment vs Average Credit Score


## 
##  Pearson's product-moment correlation
## 
## data:  MonthlyLoanPayment and AverageCreditScore
## t = 102.9909, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2871995 0.2978465
## sample estimates:
##      cor 
## 0.292532

Note: The red dot shows the mean of the plotted feature(s)
The Monthly Loan Payments have scores between 600 to 700. There is no real positive correlation visible.

3.4 Term vs Investors



Note: The red dot shows the mean of the plotted feature(s)
Loans with a term of 12 and 36 months have more investors than 60 month loans.

3.5 Loan Categories vs Investors



Note: The red dot shows the mean of the plotted feature(s)
Business and Personal Loans have the most investors while Boat, Baby & Adoption and Other have the fewest.

3.6 Year vs Monthly Loan Payment



Note: The red dot shows the mean of the plotted feature(s)
Monthly Loan Payments have been risen over the years except a fall around 2009.

3.7 Top Occupations vs Monthly Loan Payment



Note: The red dot shows the mean of the plotted feature(s)
Attorneys have the highest monthly loan payments while Bus Drivers and Administrative Assistants have the lowest.

3.8 Employment Status vs Monthly Loan Payment



Note: The red dot shows the mean of the plotted feature(s)
Employed and Self Employed People have higher monthly loan payments than retired, part-time or not employed People.

3.9 Homeowner vs Average Credit Score



Note: The red dot shows the mean of the plotted feature(s)
Homeowners have higher Average Credit Scores than non-homeowner.

3.10 Loan Status vs Average Credit Score



Note: The red dot shows the mean of the plotted feature(s)
Current, Completed and Final Payment in Progress Loans have higher Scores than Cancelled or Defaulted Loans.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The two main features are Loan Original Amount and Borrower Rate. First, Loan Original Amount:
The Features Investors (1.2), Average Credit Score (1.4) and Term (1.9) have a positive influence on the Loan Original Amount. Higher Amounts need higher Investors, Average Credit Scores or Terms.
When you look at the feature Homeowner (1.13), you can see that people with homes have higher amounts than people without homes. This makes sense because they have more securities.
The feature Occupation (1.14) is also interesting. High educated Occupations like Attorneys and Chemists get higher loans than Bus Drivers or Administative Assistants.
Finally, the feature Year (1.15): Over time, the people get higher loans. The Prosper Platform seems to get more trustworthy in the last 10 years except 2009.

Second, Borrower Rate:
Most features have no direct effect on the Borrower Rate because they stay between 5% and 35% like Stated Monthly Income (2.4) or Term (2.8). But other features have visible effects.
First, Average Credit Score (2.3). The higher the score the lower the interest rate.
Secondly, Homeowner (2.12) have lower rates than non-homeowner.
Thirdly, Occupation: Attorneys have lower rates than Bus Drivers (see above Loan Original Amount).
Finally, Year (2.14): Over time, the rates have been decreased, except the fall in 2008/2009, probably because of the financial crisis.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Monthly Loan Payment vs Year (3.6) is interesting. It’s like Borrower Rate vs Year and Loan Original Amount vs Year (see above). Over time, people are paying more for their loans taken over the Prosper platform. Perhaps, they use this platform now more than other institutions like banks.
Occupation vs Monthly Loan Payment (3.7): You can see that the Occupation has an effect to the Monthly Loan Payment. Attorneys and Architectscan pay higher loans than the others.
Homeowners vs Average Credit Score (3.8): Homeowner get higher scores than non-homeowner. This makes sense because to get a credit for a house you need automatically high scores in the beginning.

What was the strongest relationship you found?


The strongest relationships were between Borrower Rate, APR and Lender Yield in the beginning with correlations of almost 1. Next, the correlation between Monthly Loan Payment and Loan Original Amount of 93%.


3 Multivariate Analysis


1. Loan Original Amount by Monthly Loan Payment and Term



This plot shows Loan Original Amount vs Monthly Loan Payment by Term. You can see three different lines with the three different terms. Becauses of the almost perfect correlation between Loan Original Amount and Monthly Loan Payment.

2. Loan Original Amount by Borrower Rate and Homeowners



Homeowner (left) seem to have higher loans with lower rates than non-homeowner (right). One explanation can be the securities the home offers.

3. Borrower Rate by Investors and Homeowners



Non homeowner (left) have fewer investors while homeowner (bottom right) have more investors and mostly rates below 25%.

4. Loan Original Amount by Stated Monthly Income and Year



There are higher loans with higher monthly incomes in the years from 2011 to 2014 than in the other years.

5. Loan Original Amount by Average Credit Score and Employment Status


## Scale for 'colour' is already present. Adding another scale for 'colour', which will replace the existing scale.

Employed and full-time people have the highest score and highest loan amounts while people with “not available” labelled loans have the lowest score and the lowest loan amounts.

6. Loan Original Amount by Debt Coverage Ratio and Homeowners



Homeowners have higher Debt Coverage Ratios and higher Loan Amounts (bottom right and top left) than Non Homeowners. That means they are more credit worthy.

7. Loan Original Amount by Loan Per Investor and Year



The Loan Per Investor and the Loan Original Amounts have risen from 2010 to 2014 (top left and bottom right) than in the early years. That means more people trust the Prosper Platform over time.

8. Borrower Rate by Loan Per Investor and Year



The Borrower Rate seem to stay constant between 5% and 35% while the Loan Per Investor have risen in the later years.

9. Loan Original Amount by Debt to Income Ratio Ratio and Loan Status



The Loan Status doesn’t seem to have an effect to the Debt to Income Ratio because almost all have Ratios between 0 and 1. The Amounts are higher in lower ratios between 0 and 0.5. Final Payment Progress Loans and Cancelled have much lower ratios than the other 6 loans.

10. Loan Original Amount by Debt To Income Ratio and Top Occupations



Most people with the top Occupations have Debt to Income Ratio between 0 and 0.5 except Administrative Assistants and Bus Drivers. These people have also lower Loan Amounts than the others.

11. Borrower Rate by Stated Monthly Income and Loan Categories



The Borrower Rate doesn’t seem to an visible effect on the different loan categories because all have loans between 5% and 35%. But you can see that people who use their loans for Debt Consolidation, Business and Home Improvements have the highest monthly Incomes.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

First focus on feature Homeowner as color variable:
Plot 2 shows that Homeowner have better conditions than non homeowner because of higher amounts and lower interest rates.
Plot 3 shows that loans of Homeowners have more investors than loans without homes.
Plot 6 shows that Homeowners have higher Debt Coverage Ratios and higher loans than non homeowners
Secondly, the feature Year is interesting:
Plot 8 shows that Loan Per Investor has been increased although the Borrower Rates seemed to be constant.


Were there any interesting or surprising interactions between features?

Plot 10 is interesting. All of the 10 Occupations have similar Debt To Income Ratios, mostly between 0% and 50%. I expected that people with higher incomes have lower Debt To Income Ratios but all people seem to have similar ratios of debt/income.




4 Final Plots and Summary


1. Loan Original Amount by Debt Coverage Ratio and Year




You can see that the Debt Coverage Ratio changed over time. In the early years 2005-2009 (bottom left) the amounts are middle and the coverage Ratio is low. There is also high Debt Coverage Ratios and low amounts (bottom right). In the later years 2010-2014 the amounts and the ratios have risen dramatically (top left and middle right). You can see more people are trusting the Prosper Platform with higher amounts and they are more trustworthy because of the higher ratios.

2. Loan Original Amount by Stated MonthlyIncome and Top Occupations




In these plots you can see how different occupations have different loans amounts guarantueed by their incomes. There are three groups. Group 1 are top loans with top incomes like Analysts, Accountants and Attorneys. Group 2 have top loans with lower incomes like Civil Service, Car Dealer and Chemists. Group 3 have low loans and low incomes like Administrative Assistants, Bus Drivers, Architects and Biologists. Group 1 is the most preferred group and that’s why they get high loans. Group 2 has also a good standing because they get high loans with lower incomes. But group 3 isn’t that interesting for loan givers. The result is they get lower loans than the other groups.

3. Loan Original Amount by Stated Monthly Income and Loan Categories




People with high incomes (right in the plots) are investing their loans mostly in Debt Consolidation, Home Improvements and Business. People with low incomes (left in the plots) are mostly investing in the above-mentioned categories and also in Student Use, Baby & Adoption, Auto and Boats.

5 Reflection


The most difficult part in this analysis was to choose the right variables. I reduced the dataset to 15 variables and created 3 new ones. I focused my analysis to the price of the loan (Borrower Rate) and the amount of the loan (Loan Original Amount). It’s interesting that only few features have an influence to the interest rate while the amount is much more influenced by features like Occupation, existing Homes or the Score of the Borrower.
Next step for me would be an analysis of all occupations. I focused now only the Occupations with the 10 highest counts. But there are 68 different Occupations.
Over these, the platform Prosper has evolved over time. More loans are given in 2014 than all the other years and the interest rates has been decreasing, too. There must be more and more trust in the platform by its users.
Therefore, the usage of the loans is interesting. Most loans were used for Debt Consolidation and Business.
I am also a little bit surprised about the long terms (at least 12 month in this dataset) and the high rates (median 18.4%). If you have an average loan of 1000 USD, 36 months and 18.4%, you have to pay at the end 1643.03 USD. I would look for cheaper alternatives or shorter terms.

References


Dataset: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv
Explanation of Variables: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0
http://en.wikipedia.org/wiki/Prosper_Marketplace
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
https://www.youtube.com/watch?v=qz2ZV-ELVfw
http://www.r-bloggers.com/r-function-of-the-day-tapply
http://www.dummies.com/how-to/content/how-to-interpret-a-correlation-coefficient-r.html


Thank you for your attention.